Precise Zero-Shot Dense Retrieval without Relevance Labels
https://arxiv.org/abs/2212.10496
it remains difficult to create effective fully zero-shot dense retrieval systems when no relevance label is available. (Abstract)
#HyDE (Hypothetical Document Embeddings)
Given a query, HyDE first zero-shot instructs an instruction-following language model (e.g. InstructGPT) to generate a hypothetical document.
Then, an unsupervised contrastively learned encoder~(e.g. Contriever) encodes the document into an embedding vector.
(Contriever -> Unsupervised Dense Information Retrieval with Contrastive Learning 1より)
Our experiments show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever and shows strong performance comparable to fine-tuned retrievers, across various tasks (e.g. web search, QA, fact verification) and languages~(e.g. sw, ko, ja).
jaデータセット?
https://github.com/texttron/hyde
https://github.com/texttron/hyde/blob/main/approach.png?raw=true
4.1 Setup (4 Experiments)
Datasets
web search query sets (-> 4.2, Table 1)
TREC DL19(Overview of the TREC 2019 deep learning track)
TREC DL20(Overview of the TREC 2020 deep learning track)
2つともMS MARCOがベース
diverse collection of 6 low-resource datasets (-> 4.3, Table 2)
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
non-English retrieval: Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval
LangChainに入っている https://twitter.com/LangChainAI/status/1605962865449598979
埋め込み検索を利用する際、検索対象とクエリのフォーマットを統一することで精度が上がるのではないか